Statistics & Probabilities

  • Intro
  • Descriptive Statistics
  • Summary Statistics
  • Probability Distributions

Let’s start by importing some useful packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import math
import scipy
import scipy.stats as stats

0️⃣ Introduction

Quoting Wikipedia:

Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation and presentation of data.

Statistics gives the possibility to quantify uncertainty 🤔

Statisticians can make categorical statements 💪 with assurance of their level of uncertainty 👌

Statistical Work

  • Data Analysis: gathering, display and summarize data
  • Probability: laws of chance, in or out of the casino
  • Inference: drawing statistical conclusions from specific data, using probability

1️⃣ Descriptive Statistics

  • How can we discover underlying patterns in a heap of numbers?
  • How can we represent data in useful ways?
  • How can we summarize the data?

First step: gather some data 🕵️‍♀️

Experiment: ask students in a University class to give their weight (in pounds).

Male (57)

140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 175 170 180 
135 170 157 130 185 190 155 170 155 215 150 145 155 155 150 155 150 180 160
135 160 130 155 150 148 155 150 140 180 190 145 150 164 140 142 136 123 155

Female (35)

140 120 130 138 121 116 125 145 150 112 125 130 120 130 131 120 118 125 135
125 118 122 115 102 115 150 110 116 108  95 125 133 110 150 108

We can convert this raw data into a DataFrame:

male_df = pd.DataFrame([140, 145, 160, 190, 155, 165, 150, 190, 195, 138, 160, 155, 153, 145, 170, 175, 175, 170, 180, 135, 170, 157, 130, 185, 190, 155, 170, 155, 215, 150, 145, 155, 155, 150, 155, 150, 180, 160, 135, 160, 130, 155, 150, 148, 155, 150, 140, 180, 190, 145, 150, 164, 140, 142, 136, 123, 155],
    columns=['weight'])
male_df['sex'] = 'male'
female_df = pd.DataFrame([140, 120, 130, 138, 121, 116, 125, 145, 150, 112, 125, 130, 120, 130, 131, 120, 118, 125, 135, 125, 118, 122, 115, 102, 115, 150, 110, 116, 108, 95, 125, 133, 110, 150, 108],
    columns=['weight'])
female_df['sex'] = 'female'

weights_df = pd.concat([male_df, female_df], ignore_index=True)
weights_df.sample(5)
    weight     sex
32     155    male
0      140    male
64     145  female
54     136    male
31     155    male

We can now plot the data. For every number between 95 and 215, plot a bar chart counting the number of people for a given weight.

ax = sns.histplot(weights_df["weight"], bins=len(weights_df))
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
ax.set_xlabel("Weight (in lbs)")
plt.show()

🤔 What’s the name of this graph?

Histogram

A histogram is an accurate representation of the distribution of numerical data.

It is an estimate of the probability distribution of a continuous variable.

⚠️ Histogram (continuous univariate) ≠ Bar chart (categorical + numerical)

Cumulative plots

Alternatively, we can plot the count of weights inferior to a certain value.

Instead of the counts, we can also plot the density (sums up to 1) or percentage (sums up to 100).

sns.histplot(weights_df["weight"], binwidth=5, ax=ax2, kde=True,cumulative=True, stat="percent");
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
plt.show()

2️⃣ Summary statistics

🎯 Goal: communicate the largest amount of information about the distribution of data in the simplest possible way.

Typical summary statistics include measures of:

  • Location / central tendency (e.g. mean)
  • Statistical Dispersion / spread (e.g. variance)
  • Shape of the distribution (e.g. skewness & kurtosis)
  • Linear correlation of two variables X and Y

Population vs Sample

Mean

The mean of a population of N elements is defined by:

\[ {\mu} = {\frac {1}{N}}\sum {i=1}^{N}x{i}={\frac {x_{1}+x_{2}+\cdots +x_{N}}{N}} \]

The mean of a sample of the population \[ {\bar {x}} = {\frac {1}{n}}\sum _{i=1}^{n}x_{i}={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}} \]

Median

The median is the value separating the higher half from the lower half of a data sample

Odd number of values:

1 3 3 6 7 8 9

Even number of values

1 2 3 4 5 6 8 9

Then median is 4.5

Mean vs Median

Median is robust against outliers.

Mode

The mode is the value that appears most often

One mode

1 3 6 6 6 6 7 7 12 12 17

The mode is 6 and it is unique.

Bimodal dataset

1 1 2 4 4

There are two modes: 1 and 4

Skewness

👉 Skewness

Statistical dispersion

Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed.

Examples:

  • Variance \(\sigma^2\)

  • Standard deviation \(\sigma\)

  • Interquatile Range \(IQR\)

  • etc.

The variance of a population of \(N\) elements is: \[ {\displaystyle \sigma^2={{\frac {1}{N}}\sum _{i=1}^{N}(x_{i}-{\mu})^{2}}} \]

The standard deviation of a population of \(N\) elements is the square root of the variance:

\[ \sigma = {\sqrt {{\frac {1}{N}}\sum _{i=1}^{N}(x_{i}-{\mu})^{2}}} \]

For a sample of a population, a good estimator of \(\sigma\) is the sample standard deviation :

\[ s = \sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}} \]

🤔 \({\frac{1}{n}}\) would give an underestimate of the true population variance (Bessel’s correction)

Interquartile range (IQR)

It’s the difference between upper and lower quartiles. \(\text{IQR} = Q_3 - Q_1\)

💡 It can be used to identify outliers in the data set: they are defined as observations that fall below \(Q_1\) - 1.5 IQR or above \(Q_3\) + 1.5 IQR. 📈 IQR is very useful for boxplots!

But my whiskers are not symmetrical!

Common Summary Statistics

Summary statistics of a sample are the five following numbers:

  • min
  • max
  • mean
  • lower quartile (25%)
  • median (50%)
  • upper quartile (75%)

weights_df['weight'].describe()
count     92.000000
mean     145.152174
std       23.739398
min       95.000000
25%      125.000000
50%      145.000000
75%      155.500000
max      215.000000
Name: weight, dtype: float64

⚠️ Beware of the Datasaurus 🦕

Summary statistics are not enough, you need to conduct Exploratory data analysis too!

Correlation between 2 variables

The linear correlation between \(X\) and \(Y\) is also called the Pearson’s coefficient

\[ r = Corr(X,Y) = {{\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{n \sigma_{x} \sigma_{y}}}} \]

Correlation vs. Independance?

(X,Y) independent⇒Corr(X,Y) = 0

Corr(X,Y) = 0⇏(X,Y) independent (as \(r\) only captures linear dependance)

️Probability Distributions

We assign a probability measure \(P(A)\) to an event \(A\). This is a value between 0 and 1 that shows how likely the event is.

A probability distribution is a mathematical function that describes the probability of different possible values of a variable.

Examples

Take the returns of a financial time series:

# Generate series from start of 2016 to end of 2020
fig = sns.kdeplot(df.returns, fill=True, color="r")
/home/ahmed/.local/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
plt.show()

Well-Known distributions

  • Normal Distribution
  • Exponential Distribution
  • Uniform Distribution
  • Binomial Distribution

Normal Distribution

Most important distribution due to the central limit theorem: every variable that can be modelled as a sum of many small independent, identically distributed variables with finite mean and variance is approximately normal.

It is entirely described by just two parameters: \(\mu, \sigma\)

\[{\mathcal N(\mu, \sigma)} \]

import math 
from scipy import stats

def plot_normal_distribution(mu, variance):
    sigma = math.sqrt(variance)
    x = np.linspace(-10, 10, 100)
    plt.plot(x, stats.norm.pdf(x, mu, sigma), label=f"μ={mu}, σ²={variance}")

plot_normal_distribution(0, 1)
plot_normal_distribution(1, 2)
plot_normal_distribution(-3, 5)
plt.legend()
plt.show()

PDF & CDF

PDF

A probability density function (PDF) tells us the probability that a random variable takes on a certain value.

CDF

A cumulative distribution function (CDF) tells us the probability that a random variable takes on a value less than or equal to x.

Generate a normal distribution

import scipy.stats as spn
spn.norm(loc=100, scale=12)
#where loc is the mean and scale is the std dev
<scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7ff5e0181a20>

Find P LESS than

#To find the probability that the variable has a value LESS than or equal
#let's say 113, you'd use CDF cumulative Density Function
spn.norm.cdf(113,100,12)
0.8606697525503779

Find P GREATER than

#To find the probability that the variable has a value GREATER than or
#equal to let's say 125, you'd use SF Survival Function 
spn.norm.sf(125,100,12)
0.018610425189886332

Given P find the value

#To find the variate for which the probability is given, let's say the 
#value which needed to provide a 98% probability, you'd use the 
#PPF Percent Point Function
spn.norm.ppf(.98,100,12)
124.64498692758187

Fit data to a distribution

from scipy import stats
import numpy as np
from matplotlib import pyplot as plt

lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]

# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)

x_axis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.show()

Get P within a range [0.0, 2.0]

# probability 
lower, upper = 0.0, 2.0
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)

Central Limit Theorem 💪

When independent random variables \(X_1…X_n\) with common probability distribution (of mean \(\mu\) and std \(\sigma\)) are added:

  • Their mean converges towards a normal distribution as the number of sample \(n\) increases

  • Centered on the common mean \(\mu\)

  • of standard deviation \({\frac{\sigma}{\sqrt n}}\)

This holds true whatever the form of the common distribution is.

Z-score

If \(x\) is an observation derived from a random variable \(X(μ,σ)\)

\[z={x-\mu \over \sigma }\]

\(z\) = value of \(x\) expressed in number of std above/below the mean

3️⃣ Probabilities

Probability

We assign a probability measure \(P(A)\) to an event \(A\).

This is a value between 0 and 1 that shows how likely the event is.

Sets

Probability theory uses the language of sets. A set is a collection of some items (elements).

Example: A={♣,♢} You can perform operations on sets and visualize them with Venn diagrams:

  • Union
  • Intersection
  • Complement
  • Substraction
  • Partition

Random experiment

  • A random experiment is a process by which we observe something uncertain

  • After the experiment, the result of the random experiment is known: it is the (outcome)

  • The set of all possible outcomes is called the sample space \(S\)

Examples:

  • Toss a coin. \(S=𝐻,𝑇\)

  • Roll a die. \(S=1,2,3,4,5,6\)

  • Observe the number of goals in a soccer match. \(S=0,1,2,3,…\)

When we repeat a random experiment several times, we call each one of them a trial.

Sample space \(S\) is defined based on how you define the random experiment…

❓ If the experiment is Toss a coin three times, what’s the sample space?

S={(H,H,H),(H,H,T),(H,T,H),(T,H,H),(H,T,T),(T,H,T),(T,T,H),(T,T,T)}

☝️ The goal is to assign a probability to events, defined as subsets of a sample space \(S\).

Conditional Probability

Likelihood of an event occurring given that another event has already occurred.

It quantifies the probability of one event happening under the condition that we know a related event has taken place.

The conditional probability of event A occurring given that event B has occurred is denoted as P(A|B) and is calculated as:

\[P(A|B) = P(A ∩ B) / P(B)\]

where:

  • \(P(A|B)\) is the conditional probability of event A given event B.
  • \(P(A ∩ B)\) is the probability that both event A and event B occur simultaneously (the intersection of A and B).
  • \(P(B)\) is the probability of event B occurring.

In simpler terms, it answers the question: “What is the probability that event A happens if we know that event B has already occurred?”

Bayes’ Theorem

Bayes’ Theorem is a formula used in probability to update our beliefs about an event based on new evidence. It helps us find the probability of an event happening, given the probability of another related event.

\[P(A\mid B)={\frac {P(B\mid A) \cdot P(A)}{P(B)}}\] Where:

  • \(P(A|B)\) is the probability of event A happening, given that event B has occurred.
  • \(P(B|A)\) is the probability of event B happening, given that event A has occurred.
  • \(P(A)\) is the probability of event A happening.
  • \(P(B)\) is the probability of event B happening.

Bayes’ Theorem allows us to update our belief in event A (posterior probability) based on the observed evidence from event B.

Example

Let’s consider a medical example to understand how Bayes’ Theorem works:

Suppose there is a rare disease that affects 1% of the population. A medical test has been developed to detect this disease, and the test is 95% accurate. This means that the test will correctly identify a person with the disease 95% of the time (true positive) and will correctly identify a healthy person 95% of the time (true negative).

However, there is also a chance of false positives and false negatives. Specifically, the test incorrectly identifies a healthy person as having the disease 5% of the time (false positive), and it incorrectly identifies a person with the disease as healthy 5% of the time (false negative).

Question: If a randomly selected person tests positive for the disease, what is the probability that they actually have the disease?

Given information:

  • P(Disease) = 0.01 (probability of having the disease)
  • P(Positive|Disease) = 0.95 (probability of testing positive given the person has the disease)
  • P(Positive|No Disease) = 0.05 (probability of testing positive given the person does not have the disease)

Solution

P(Disease|Positive) = (P(Positive|Disease) * P(Disease)) / P(Positive)

def bayes_theorem(p_disease, p_positive_given_disease, p_positive_given_no_disease):
    p_no_disease = 1 - p_disease
    p_positive = (p_positive_given_disease * p_disease) + (p_positive_given_no_disease * p_no_disease)
    p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
    return p_disease_given_positive

# Given information
p_disease = 0.01
p_positive_given_disease = 0.95
p_positive_given_no_disease = 0.05

# Calculate the probability of having the disease given a positive test result
probability_disease_given_positive = bayes_theorem(p_disease, p_positive_given_disease, p_positive_given_no_disease)

# Convert probability to percentage
percentage_disease_given_positive = probability_disease_given_positive * 100

print(f"The probability of having the disease given a positive test result is approximately {percentage_disease_given_positive:.2f}%.")
The probability of having the disease given a positive test result is approximately 16.10%.